Methods of providing fast search, analysis, and data retrieval of encrypted data without decryption

ABSTRACT

Methods and systems of providing remote coded data storage, data analysis, and search and retrieval, with assurance of data security are described. Data security is such that it protects the data from any provider, administrator of remote services, or anyone breaking into the servers housing the data at the remote site. The methods include a coding schema such that both the storage and the associated services, such as data analysis, search and retrieval, can be provided even more efficiently and more responsively than without the coding. Possible applications of the methods include data storage, powerful data search and analysis services which can all be provided “in the Cloud” over the Internet, completely securely, even when a customer&#39;s private data set needs to be uploaded to the remote site. The efficiency of analysis, and search means that the methods may be useful even when security of data is not an issue.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of the filing date of U.S.Provisional Application No. 61/666,917, filed Jul. 1, 2012, entitled“Methods of Providing Fast Search, Analysis, and Data Retrieval ofEncrypted Data Without Decryption” the disclosure of which isincorporated by reference.

BACKGROUND OF THE INVENTION

The present invention relates generally to data security, and moreparticularly to data coding or encryption.

Providing Cloud services for private data searches, data mining, dataextraction, and other data analysis tasks, requires keeping the dataprivate. The standard way to keep data private is to encrypt it. Howeverencrypted data has to be decrypted during searches, slowing the processof searching and data analysis, and providing an opportunity for abreak-in compromising the data.

Several different methods of encrypting data and reducing the resultinghit on search performance have been proposed.

BRIEF SUMMARY OF THE INVENTION

One aspect of the invention provides a computer implemented method ofproviding secure data storage, the data comprised of data items whichare comprised of data components, a plurality of the data componentsbeing comprised of a plurality of text characters, the methodcomprising: coding at least all data components needing secure storagesuch that each unique data component of a plurality of data componentsis assigned a unique code unrelated to the semantic meaning of the datacomponent; storing the data using the coded data components; ensuringthat decoding a coded data components is not needed to search for it;ensuring that to replace each code with a corresponding data componentrequires a table with at least as many code entries as there are codesused.

Another aspect of the invention provides a computer implemented methodof providing data storage and search services, using a client-serversystem, in which a client computer is in a first location and a servercomputer is in a second location, the database comprised of data itemswhich are comprised of data components, the method comprising: choosinga plural set of data components for coding in the first location;assigning a number code to each of the chosen data components in thefirst location; assigning identifiers to each of a plurality of dataitems in the first location; in the first location, creating a codetable for converting each coded data component's assigned number code tothe data component, such that the number code is arithmetically relatedto the number of the table row which contains the data component; andstoring the number codes at the second location; wherein the code tableis stored in a location other than the second location.

Another aspect of the invention provides a computer implemented methodof providing access to data items in a collection of data items, using aclient-server system in which a client is in a first location and aserver is in a second location, the method comprising: in the firstlocation identifying data components of data items, a plurality of datacomponents comprising character strings consisting of more than twocharacters; in the first location assigning a number code to identifyeach of a plurality of data components and an identifier to each of aplurality of data items; in the first location creating a code table inwhich each row number is arithmetically related to the code of a datacomponent and the corresponding table cell contains the data componentor a reference to the data component; storing information indicative ofthe number codes in the second location; and in the second locationperforming a search of data items matching a Boolean query comprised ofnumber codes of data components with the code table stored in a locationother than the second location.

Another aspect of the invention provides a computer implemented methodof coding data by assigning whole number codes to data components ofdata items, the method comprising: accepting input of a data component;comparing the data component to other data components that have alreadybeen coded; assigning a whole number code to the data component; storingthe data component and its code; performing a search for a datacomponent without decoding the data and without adding any performanceoverhead as compared with searches through uncoded data.

Another aspect of the invention provides a method of storing, searchingand retrieving data such that the stored data is coded and remains codedat all times during the searching, retrieving and the searching isperformed faster compared to the search through the same uncoded data.In some such aspects the searches and retrieval of the coded data areperformed at a first location and the retrieved data is decoded at asecond location. In some such aspects a client computer program islocated at the second location and a server computer program is locatedat the first location.

These and other aspects of the invention are more fully comprehendedupon review of this disclosure.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a block diagram of a client-server arrangement in accordancewith aspects of the invention.

FIG. 2 is a flow diagram of a process in accordance with aspects of theinvention.

FIG. 3 is a flow diagram of a process in accordance with aspects of theinvention.

FIG. 4 illustrates aspects of a further process in accordance withaspects of the invention.

FIG. 5 illustrates aspects of a still further process in accordance withaspects of the invention.

DETAILED DESCRIPTION

The methods described here replace each word, phrase or other datacomponent in the data with a code, preferably a whole number code. Datamoved to a server, accessible over a network, for example the Internetis entirely in terms of the codes, and the server may be considered aCloud server. Search terms entered by a user or client software areconverted to respective codes and the coded query is sent to the server.Searches are carried out by matching the codes of the query with thecodes of the data, without the need for translation. This has theadvantage of keeping the data private and an additional advantage thatsearches are more efficient so responses are faster than if text wereused instead of number codes.

A table, or multiple tables relating number codes to text, identifiersof data components, or their locations, is the key needed for decodingthe results. This key can easily fit on a flash plug and can itself bepassword protected. Such an arrangement concentrates security control onthe key which can be kept under the data owner's control.

An example of a client server arrangement in accordance with aspects ofthe invention is illustrated in FIG. 1. A server 111 maintains data,with the data coded as numbers. A client 113 communicates with theserver over a network, for example the Internet. The client sendsqueries to the server using the codes of data components, and receivesresponses to queries. The client may request data from the server, inwhich case the server returns the data coded as numbers. The clientdecodes the data using a key, which may be stored on a portable memorydevice 115, for example.

Coding of Data

In computers everything is a number. In particular, the atoms of textare characters and each character is coded as a whole number. This hasbeen the practice since the start of computers.

In many applications, using today's computers, it is more convenient tocode larger data components, such as commonly occurring text charactercombinations, punctuation marks, words, phrases, sentences, as alsoevery graphic, movie, music, sound, or other element. We will refer tothese as data components. Each component can be coded as a whole number.Each identifiable attribute of a text or a non-text component can besimilarly coded. In documents containing graphics, attributes of eachgraphic can be coded and/or the graphic itself can be assigned a number.Some of these codes can reference the component indirectly, for exampleby its location, while others can directly reference the component.Examples of attributes of graphics are graphic size, type (drawing,picture, color or black and white etc.) date created, and any otherattributes that can be useful when searching, and similarly for otherdata components. Considering the example of FIG. 1, in many embodimentscomponents, which in various embodiments include text charactercombinations and/or attributes of graphics, are coded as whole numbers,with the numbers, being considered number codes, stored in storageassociated with the server 111. Communication between the server and theclient 113 is of the number codes, and not the data components. A key,for example stored on a portable memory device such as a thumb drive115, which is not located with or accessible by the sever, relates thenumber codes to the data components and/or items.

Applications of Coding

Applications of this method of coding include data stores of all kinds,such as for example data records in a database or files on disk. Acomputer operating system can also use these methods to facilitatelocating information in files and additionally provide security of datastorage and protection against a break-in to the computer. Suchprotection can be possible if all communication of other computers, withthe operating system on the protected computer, is similarly coded.Applications which create files, such as word processors, spreadsheets,etc. can also usefully use the methods described here to store the codedfiles on disk and so protected from anyone who does not use the decodingtable.

Using the system to encrypt emails can provide secure email. For emailsthe initial coded vocabulary can be the initial key. This initial keycan also include codes for every text character or more generally acharacter tuple. When a new word is used by any correspondent, thesetuple codes can be used to code it. Thus the coded word can be sent toeach correspondent and the word code can be added to the key on eachuser's computer. When two correspondents each need the same new wordwhen corresponding with others, the same word can be assigned multiplecodes.

There are several ways to handle such situations. One way is to use aserver which manages coordination of the vocabulary. Each user's codedemail application can automatically notify the server of any changes inthe vocabulary and the server can update all users of the codes. Anothermethod can arrange to send, in an email, all changes of the coding key,that are new to a correspondent. All additions to the coding key can beperformed entirely in terms of codes indicating the translation of eachcharacter code combination to a new code number.

We describe in detail the advantages of such whole number coding in thecontext of a database application as an example. Other applicationsshould be evident from these descriptions. The preferred method is basedon using whole number codes for every data component, so that eachoccurrence of a coded data component is replaced by its whole numbercode and all processing of the text is performed using these codes. Animmediately clear advantage is that indexing of such data components ismuch more efficient. Another advantage is that access to a particulardata component code to determine its meaning can be extremely fast if anarray is used in such a way that the code number is the index of thearray or is arithmetically related to the array index at which the valueis the data component.

Although whole number codes are assumed in this description, any othercodes, whether numbers or character combinations, can be used instead.For example, there may be occasions where decimal numbers can beconveniently used. The whole number portion of the decimal can berelated to the array index and the decimal portion can be used to codeother dimensions of the components, such as the type of component.

One way of coding every data component is to associate the code directlywith the data component, when the data component is textual, or with areference to the location of the data component when it is non-textual.The assignment can use formulas or can be arbitrary. The use of formulashowever is not as safe as when it is arbitrary and requires a table ofcodes. The safest coding table is one in which the only relationshipbetween the codes and the data components is defined by the table andthe order of assignment can be arranged to be totally arbitrary, orrandom. Such a table can be considered the “key to the data” just like asingle password or code, but one that is much harder, or even impossibleto break.

The frequency of occurrence of data components within the data may be atarget of code breaking attempts if the subject domain of the data isknown in some detail. To avoid this possibility, the coding can usemultiple codes for each of the more frequently used data components,thus obscuring the true frequencies of use.

Data can be stored, quite securely, encrypted using any of thewell-known encryption methods. However, the decryption during searchprovides an opportunity for “spying eyes” to compromise security and thedecryption time slows down performance. Several attempts at overcomingthis problem have been made. One recent such method called CryptDBclaims an overhead of between 14.5% and 26% compared to the unencryptedcase, providing a substantial overhead to performance or throughput.This is claimed as a great improvement over anything previouslyavailable. Generally, the methods described here not only do not add anoverhead, they greatly enhance performance, effectively reducing queryresponse times.

Coding Vs Encrypting

We use the term “coding” of data components to mean replacing the datacomponents by codes, each of which contains no information related tothe data component, other than the relation provided by the key.

Encrypting is usually understood to mean the coding of all datacomponents using a single schema based on a single password ofreasonable length and based on the textual content of each part of text.This means the result of encryption contains within it the informationcontained in the encrypted text. In contrast the number coding schemadescribed here should be seen as a text independent encryption of eachdata component. This means that the codes do not contain with them anyinformation present in the text they represent. The connection to themeaning of each code is outside and independent of the code itself, andis provided by the key. That information resides in the coding table (orkey) alone. Such a schema can make the breaking of the codes eithertotally impractical, or impossible, even when a very large corpus ofcoded data is stolen.

Database Example

If the text data in a database is stored using a whole number code foreach word, phrase, punctuation mark, special symbol and/or sentence, itcan increase the speed of searching and provide safe coding and storageof the data. Normal text searches find matching words by checking thematch of each text character in each word. Searching for a whole numbercode representing a word needs only to check the match of one number. InEnglish text, for example, the average word length is about 5characters, each of which, in two-byte unicode, uses 2 bytes. In mostdatabases the total number of needed data components, mostly words, willcertainly not exceed the limit of a 4 byte number (4,294,967,295).Searching for each word is equivalent to finding a match to a 4 bytenumber. That alone means a saving in search time.

However, a much greater time saving can be achieved because the searchBoolean can be expressed in terms of whole number codes. As a userenters a search term, it is converted to its number code which alsodetermines that it exists in the code table, and is used in the query,thereby expressing the query in terms of number codes and Booleans. Fastsearches for data items containing data components, involve locating awhole number code corresponding to a data component and can be carriedout using a location index table, also called an association matrix,storing all the IDs of data items containing each data component. Oneway of storing an association matrix uses a bitmap, or array of bitvectors. Each row number of the matrix is the ID of a data component andeach column number is the ID of a data record or a data item which couldbe a join of records. In this bitmap, or binary version of theassociation matrix, a 1 in a cell in column C and row R represents thepresence of the data component with ID number C, in the data item withID number R, representing the presence of an association between thedata component C and data item. R. Conversely, a zero in that cellrepresents no association between the respective parts. Such a binarystorage arrangement is quite often a very sparse matrix, in which thegreat majority of the cells are zeros. This means that its storage spacerequirements are much larger than they need be. Table 1 illustrates abitmap representation of the association matrix with a simple example.Each row in such a table is of fixed length and the row is representedby a data component bit vector, with the number of bits equal to thetotal number of data items.

To take advantage of this sparseness, some or all vectors of anassociation matrix can often be more optimally stored as an array ofdata component vectors, where each vector component is the ID of a dataitem associated with the component. This representation of theassociation matrix is referred to as the ID Vector representation. Table2 shows the example from Table 1, represented as an array of ID Vectorsor Rows, each of which in this kind of table can have a differentlength, the length of each data component vector being the count of dataitems associated with the data component. In estimating storage sizes,we use the average number of data items per data component.

In both the bitmap and the ID Vector representations, locating a targetwhole number code does not need a search for any matches, because thetarget whole number code is the array index in the association matrix.This array index is related by the compiler simply arithmetically to theaddress of the respective row vector. The components of each row vectorin this association matrix are the identifiers of all the data itemscontaining or associated with the respective data component.

TABLE 1 Association matrix bitmap representation row number = arrayindex = ID of data Column number = ID of each data item component 1 2 34 5 6 0 1 1 1 1 0 0 1 0 0 1 0 1 0 2 0 0 1 1 0 1 3 1 1 0 0 1 0

TABLE 2 Association matrix ID vector representation row number = arrayindex = ID of ID of ID of ID of ID of data associated associatedassociated associated component data item data item data item data item0 1 3 4 2 1 5 3 2 6 4 3 3 1 2 5

The storage space required for such a matrix table, even for very largedatabases, is quite easily accommodated in current computers. Forexample the number of unique data components in a large database, ofmostly text data items, can be as large as 10 million. If the averagenumber of data items associated with a data component is about 500, thespace required to store the matrix can be about 20 GB, assuming that weuse 4 bytes to store each ID. Such matrices can be held on disk, or inRAM. Disk access can be very fast because searching for matching itemsneeds to access only one vector, for each data component comprising aquery. For even faster access, RAM (or flash RAM) can accommodate suchindexes.

The most efficient structure for the association matrix can use an arrayof vectors of two types. Sparse vectors can be stored as ID vectorswhile the more dense ones can be stored as bit or binary vectors. Forexample, if 32 bit numbers are used for IDs, then when the density ofbits in a bit vector representation is less than or equal to 1 in 32,the bitmap representation can be used. For other densities, the IDnumber representation can be used. The first element of each vector canrepresent its type.

Remote Data Storage

There are additional advantages of such data component indexing,particularly when searches are to be carried out on a remote computer.An important consideration in today's markets is the security of datastorage “in the Cloud,” or stored in storage available over a network,generally the Internet. We will refer to this and other similararrangements as remotely stored data.

Although security of such remotely stored data is improving, breacheshave occurred in the past and will occur in the future despite bestefforts. Using the method of coding data components and then for remotedata storage, storing all data only in terms of the whole number codes,can provide better security. Anyone breaking in to the data storagelocation will not be able to extract any information from the datawithout having access to the meaning of each code, that is access to thecode table.

The coding table stores the meaning of every whole number code. This isall that needs to be secured and can be stored on the client side, noton the remote computer. It is then up to the user to provide securityfor the coding table. In case of a break-in to the remote computer, thecodes are of no use without the coding table. The coding table is quitesmall in comparison with the data and so can be stored on a flash plugwhich can be stored in a safe. Communication between client and server,often carried out over the internet, can be entirely in terms of thecode numbers and so the method also protects against any compromisedsecurity of the communication system.

A coding table, which we will call the key, for a large databasecomprised, for example, of documents, may contain as many as 10 millioncoded data components. If the average length of a data component isassumed to be 10 two-byte characters, the space needed for the key,uncompressed is 200 MB, easily storable on a flash disk. Of course thekey can be stored encrypted and compressed using standard encryption andcompression methods.

Transactions and Search Methods

The use of whole number codes has many applications. We describe anapplication to the storage of data and the searching of a database. By adatabase we mean a collection of any data items, such as any documentfiles, records or their joins, which in general we call items. Thesecoding methods can be used in a traditional database, or in a TIEimplementation, or in other faceted navigation implemented databases.

The use of whole number codes of data components allows us to performall editing, searching and other transactions entirely in terms of thewhole number codes. It is only when a user wishes to see any of the datacomponents that a conversion from whole number codes to data componentsis made. The number of data components that are decoded at one time, fordisplay to the user, is therefore limited to those that can be usefullyexamined at one time. This number is quite small which means that theirlookup can be very fast, even if the necessary indexes are stored ondisk.

For optimum lookup speed, a single index table, during initialization ofa client, can be converted to two tables. One, the forward table, canprovide a fast lookup of a text data component to obtain its wholenumber code, and can be stored as a hash table, keyed on the datacomponent, or a reference to it. This is used when a new entry is made,during the coding of data, when new data is added, and when a userenters search terms. The other, called the reverse table, can provide afast lookup of a whole number code to obtain the data component. Thatcan be stored as an array of text data components, where the array indexis the ID of the data component. Such an array can be disk based, orstored in RAM.

For fast location of data component IDs, the forward table can be keptin RAM or on disk. In this forward table, a data component (or areference to it) can be the hash key and the value can be the IDassigned to that data component. Each added data component can bechecked for its presence in the forward table. If it is already presentthen its ID can be used. If it is not present then the next available IDis used and the new entry is added to both the forward and the reversetables.

It is often desirable to convert every data component of each data itemto a number code. All coded data can then be stored in any convenientlocation with assurance of security. When data is to be retrieved, thereverse table can be used to convert each whole number code to arepresentation suitable for transmitting to a calling program orpresenting to a user. Without access to this coding table, the meaningof the stored whole number codes can be unavailable and so storage ofsuch coded data need not be secure.

An alternative to coding every data component is to not code certaincomponents, or certain data items. For example the punctuation marks andspecial symbols, such as :;.%, $, !, ?, and other symbols can be usedwithout being coded. However when high security is needed, this approachmight provide certain clues to a code breaker, no matter how weak thoseclues may be. Security can be further enhanced if all field names arealso coded.

An additional advantage of coding every data component is that the datastorage is easier. So for example data stored using the number codesrequires some separator between the numbers, such as a space, or commaand any space or comma needed in the data can require an additionalspace. This makes parsing of the coded data more awkward. So it iseasier if a literal space in the data can be represented using a numbercode. Similarly, if a comma is needed it can be coded. We can then storethe data using any character (except a digit of course) as the separatorbetween number codes.

The client-server architecture can be arranged in such a way that theserver performs all its tasks on the data using only the number codes,while everything the client uses to display to the user, or to use incalculations, is on the client's computer, in the translated datacomponent form, using the reverse table. Without compromising datasecurity, the coding table (both forward and reverse) can be in the samelocation as the client, even when the server is at some remote and lesssecure location.

Using this system, data storage services can be provided “in the Cloud”e.g. at one or more remote locations accessible over the Internet, withassurance that even the provider of such a service will not be able tosee the real data. Some of those providing such services can alsoprovide tools for searching through the data, without any of the realdata being revealed to the provider, or to anyone having access to theservice without the key.

Offering such remote data storage services can be implemented byproviding the user a web location to obtain a download of the codingsoftware application, referred to here as the Coder. Such a Coder canthen allow the user to choose which data sets they want to storeremotely. The coder can analyze the data on customers' secure computers,creating the coding table and a coded copy of the data. It can thenupload to the remote location the coded data.

When a user desires to perform a transaction or to search, the necessarysoftware tools can be provided from the remote site. To perform anyaction on the remote data, the coding table can be required on theclient side. That coding table, as already explained can be used toconvert any entered data components into codes and any codes returned bythe server into data components.

The following are examples of how various client controlled processescan work. In a first example, discussed with respect to FIG. 2, data isadded. In a second example, discussed with respect to FIG. 3, data issearched. In some embodiments the process of FIGS. 2 and 3 may beperformed by a client computer in communication with a server, which mayalso perform portions of the process.

Transactions

A user enters new data in block 211. The data may be entered into aclient computer, for example.

-   1. The entered data is converted into data components in block 213,    for example by the client computer.-   2. The data components are coded using the forward table in block    215, for example by the client computer. Those data components    already assigned number codes use those codes. New data components    are assigned new number codes, in sequence in some embodiments,    which are added to the reverse table and the forward table. Their    effect on the association matrix is also updated.-   3. The new data and any changes to the association matrix are sent    to the server using codes in place of each data component in block    217.-   4. The server updates its matrix and anything else affected by the    added data in block 219.

Searches

A user inputs a search in block 311. The search terms of the search maybe entered into a client computer, for example.

-   1. If user inputs a data component, the entry is checked using the    forward table in block 313, for example by the client computer, to    make sure it matches one of the data components. Matches can be    checked after each character is added to the search input, or after    the input is completed. Auto-completion can be used if needed. When    or if the user chooses a data item from those presented, its number    code can be directly associated with the presented and chosen data    item. User inputs that do not match any of the data components can    be rejected, implicitly, by not accepting the typed mismatched    characters, or explicitly by notifying the user, or both.-   2. The data components in the query are converted to number codes    from the forward table in block 315, for example by the client    computer.-   3. The query, comprised of number codes and when necessary (or in    some embodiments when approriate), Boolean operators, is sent to the    server for evaluation in block 317. For added security, Boolean    operators can also be coded.-   4. The server responds using number codes and item identifiers which    may also be in terms of number codes, and the client computer    receives the response in block 319.-   5. In the response, the client computer converts the number codes of    elements, used for presentation to the user, to their corresponding    data components, using the reverse table in block 321 and presents    the response to the user in block 323.-   6. If the user requests to view found items, the item codes are    converted to item references, which often comprise file location and    offsets into the file, and the requested data items are presented to    the user.

The searches on the server can be made very fast by creating anassociation table as an array of data component vectors, each vector isidentified by the table row number which is the number code of thecorresponding data component. The data component vector's components arethe identifiers of all data items containing that data component ordescribed by it. Boolean queries comprised of data component codes canthen be easily and efficiently evaluated as unions, intersections, orcomplements of the sets of data item components of the relevant datacomponent vectors.

The server and client arrangement, in various embodiments, can be anyfunctioning database system which can be setup to manage whole numbercodes instead of the uncoded data components. There is generally noreason why any data management system would not be able to use numbercodes in place of data components. Generally, the only time that thesenumber codes need translating to the corresponding data components, iswhen the results need to be presented to another application, to a user,or the specific values of the data components are needed. In essence,after the conversion to number codes, the coded data can be stored inany database and searched in any available way.

When calculations are to be performed on the server side and theseinvolve the values of the data components, the calculations can beperformed on the client where decoding is possible. If the calculationsare to be performed on the server because they are part of the query andso determine which items match the query, they can in most cases stillbe performed on the client as follows.

Suppose the query comprises any condition C(S) involving a set of datacomponents S. Then the query can be evaluated with the conditionrequiring evaluation replaced by the requirement that all datacomponents needed for the calculation are present in every data item. Inaddition the query response can require the list of identifiers of allthe data entities in the set S. The client can then be able to apply thecondition C(S) and then send a modified query to the server using theresults of the evaluation. The evaluation can limit the set S of datacomponents to only those that satisfied the criterion C(S) and themodified query can use that to determine the matching items. For mosttypes of C(S) however the modified query is not even necessary and amuch simpler method can be used.

For example, suppose in a healthcare database C(S) is the condition tofind all encounters in which the patient is in a specified age range andhad specified symptoms. The client has all the field values as part ofthe key, so it has all the ages in the age fields of all the records.Therefore the client can create a coded query, with the actual ages thatmeet the specified range, in a parenthesized disjunctive subset with aconjunctive of the specified symptoms.

Methods Of Query Execution

When a relational database is used, the database's query execution canbe used. However, when using number codes for all data components, it ismuch more efficient to take advantage of this and use more optimalmethods for query execution. One such method creates associationmatrices storing the association of each record with its field values.When the field values are all whole numbers, these can be used in atable as the row number, while the column number can represent the ID ofa record. This table we refer to as the matrix. Its two commonimplementations are as an array of vectors, where each vector is anarray of bits or an array of ID numbers of the non-zero bits.

Using the associative matrices, the methods of executing queries can beoptimized as follows. We usually store each row of an association matrixas an array of vectors whose components are the column numbers of thenon-zero cells in the corresponding bit vector. Assuming the use of 32bit IDs, storing a matrix as a bitmap is more compact only when thematrix is more dense than one in 32 non-zero bits. However, whenexecuting a query it is often more performance optimal to convert thevectors being used in the query evaluation process to bit vectors. Thefollowing explains one optimal set of method steps.

-   1. A query typically consists of a set of data components and a set    of Boolean operators. The evaluation of such Booleans, in the    simplest cases, involves unions and intersections of vector    components of data component vectors, each component is an ID of a    record. So that for example the conjunctive Boolean between data    component A and data component B is evaluated and the vector    components of the result vector C are the IDs of the matching items.-   2. The result vector is then conjoined (or disjoined) with the next    data component vector, if any, in the Boolean and the process    proceeds in that way.

Next we describe some optimal methods of evaluating the conjunction andthe disjunction between two vectors.

Vector Conjunctions and Disjunctions

When the two vectors have components which are sorted indexes of thenon-zero bits in the corresponding bit vector form, the common method ofevaluating their conjunction or disjunction is the well-known zig-zagmethod. However we describe here a method that is faster in performanceand does not require the vector components to be sorted.

-   Conjunctions

Let the two vectors to be conjunctively combined be A and B bothrepresented in ID component form. The process is described in terms of Aand B but both these are replaced after each step in an iterativeprocess. The process, an example of which is shown in FIG. 4, inillustrative form for a single B vector, is as follows:

-   1. Assign a first data component vector (or more generally query    result vector) to be vector A and a next data component vector to be    vector B;-   2. Convert B to a bit vector by using each component ID of the ID    vector to address the corresponding bit index of a bit vector and    setting it to 1;-   3. Iterate through the ID components of vector A using each vector    component as the index into the bit vector and if that bit component    is not a 1, remove the component from vector A;-   4. The modified or temporary result vector A is then used with the    next vector assigned to B, to be conjoined with vector A and the    process repeated from step 2 until all conjunctions are completed.-   5. The resulting modified vector A is the result vector, whose    components are the IDs of the matching items.

Usually the conjunctions of only a small number of data components areneeded. After every additional conjoined data components the number ofvector components of the resulting vector gets smaller, therefore thezig-zag method can be quite satisfactory in performance. However whenthe number of data components to be conjoined is large, the methoddescribed can improve the performance considerably.

Disjunctions

A similar method is used to evaluate the disjunction of a set ofvectors. The optimized process for the disjunction of two vectors A andB, an example of which is illustrated in FIG. 3, is as follows:

-   1. Assign a first data component vector to vector A and a next    vector to vector B;-   2. Convert A to a bit vector by using each component ID of the ID    vector to address the corresponding bit index of the bit vector and    setting it to 1;-   3. Iterate through the ID components of vector B using each    component as the index of the A bit vector and setting it to 1;-   4. Modified bit vector A is then used as the result vector and    disjunctively combined with the next vector assigned to B and the    process repeated from step 3 until all disjunctions are completed.-   5. The resulting modified vector A is the result vector, whose    component bits designate the IDs of the union set of the components    of all the disjoined vectors.

Finally we describe the counting process, the steps that result in thecounts of all items associated with each data component. These counts wecall frequencies.

Once the set of matching items is determined, the items-to-datacomponent matrix may be used to determine the frequencies. The processsteps are very similar to the disjunction steps just described, butinstead of using a bit vector for the output vector (vector A) we use anarray of counts vector (more simply referred to as the counting vector)for vector A. This can be an array of integers, each integer largeenough to store the largest count of items and the size of the arraysufficiently long to store the counts of associated items with all thedata components whose frequencies are needed. Each array index is madethe ID of a data component, which allows the addressing of each countingelement just like addressing the bit of each bit vector. The steps arethe following:

-   1. Create the counting vector array A, initialize it to an    appropriate size and set all counts to zero;-   2. Use the components of the next item vector as indexes into the    counting array and at each addressed index increment the count;-   3. Repeat step 2 until all item vectors matched by the current query    have been processed;-   4. The resulting counting vector A contains the counts of the    matching items associated with every data component. Those with zero    counts can be made unavailable for conjunctive additions to queries.

Additional Data Security

Words, under any convenient definition of the meaning of “word” are themost common data components. Assigning consecutive whole number codes towords and various additional symbols (such as currency signs, percentsigns, punctuation marks, etc) in some systematic order can provide someclues to a very determined adversary intending to break through thecoding. One way to make that more difficult is to assign consecutivewhole number codes to a randomized ordering of the data components. Thiscan make it impractically difficult to discover the coding. However, afurther step can be taken to make it even more difficult to break thecoding.

If adversaries know the nature of the data, they may be able to analyzethe usage frequency of the codes and compare them with the usagefrequency of words in similar data. To foil any such attempts, the trueusage frequency of the codes can be disguised by using multipledifferent whole number codes for each of the more frequently used words,phrases or other frequent data components. A different whole number codeof several uses of such frequent words, can then be used for eachinstance of the word in any data passed to or from the server, or forany data resident on the server.

The following is one way of achieving such a frequency disguise. Sortthe complete unique word vocabulary by the frequency of each word's usewithin the database. Then each word with the highest frequency can beassigned the largest number of different whole number codes, while theones in the lower frequency groups can be assigned a smaller number ofwhole number codes. To even out the frequency of occurrence the numberof codes to be assigned to a frequent data component can be madeapproximately proportional to that component's frequency of occurrencein the data.

Table 3 is an example table of a small sample of relatively few words,their average occurrence frequencies per item and a possible choice forthe number of whole number codes to be assigned to each.

TABLE 3 Frequencies per Relative Number of IDs document frequencies tobe assigned Word 922 15.9 16 Of 842 14.5 15 the 518 8.9 9 and 364 6.3 6in 345 5.9 6 is 337 5.8 6 a 337 5.8 6 to 235 4.1 4 that 161 2.8 3 are158 2.7 3 p 149 2.6 3 as 113 1.9 2 be 112 1.9 2 with 111 1.9 2 memes 1081.9 2 this 100 1.7 2 for 97 1.7 2 or 95 1.6 2 can 92 1.6 2 by 86 1.5 1one 78 1.3 1 it 71 1.2 1 we 70 1.2 1 which 69 1.2 1 knowledge 67 1.2 1an 65 1.1 1 information 64 1.1 1 cultural 58 1 1 on

In this example, the remaining words occurring less frequently than thelast word listed in the table can be assigned single whole number codes.

Almost any number of IDs assigned to the frequently occurring words, orin general data components, will distort the actual frequency ofoccurrence and this may provide sufficient security.

Example Application for Data Storage

One possible application of these methods is to provide a web basedservice of secure data storage, access and analysis. The followingdescribes one possible implementation of such a service.

-   1. A web based computer, or a virtual computer, referred to as a    remote computer, houses a data server, accessible over the Internet.-   2. A user contracts to use the service and store their data on the    remote computer-   3. The user contracts to be able to:    -   3.1. use the data server on the remote computer to house and        maintain the data.    -   3.2. provide access to the data to download data records when        needed, and optionally provide data analysis and data search        capabilities.-   4. The server, or a special application, downloads a coding    application, called the coder, to the client computer.-   5. The coder enables a user to choose the data to be coded and    housed on the remote computer.-   6. The coder processes the designated data, creates the forward    table and reverse table.-   7. The coder codes the data components in each data item and    optionally assigns an item name, an item unique identifier, an item    location reference, and creates an item identification table.-   8. The coder sends the coded data items, such as records or    documents, to the data server in the remote location, in which all    the data components of the data items are replaced by the    corresponding number codes.-   9. To minimize search times, the server may create an association    matrix, storing the association between each code of a data    component and each identifier of the item containing it.-   10. The server may also create other useful indexes.

The server can be any database server, including a relational databaseserver, a TIE, or any server implementing a faceted navigation typesearch system. The client can be any client able to communicate with theserver and able to handle the translation of codes.

An alternative system can store the uncoded data on the client computer,identifying each item with a code, a location reference, and coding eachdata component as in the previous alternative. The data itself need notbe coded (although it can be coded) and can reside entirely on theclient computer or on the local area network. This arrangement canrequire less data transfer from client to remote server, while allowingthe remote server to perform all searches and analysis.

Methods of Coding Data Components

A preferred embodiment can code every word in the data, including wordsin the field names and table names. Here by “words” we mean a sequenceof alphanumeric characters starting at start of text or following adesignated non-alphanumeric character (word breaking character) andending at end of text or at a designated non-alphanumeric character(word breaking character). The word breaking characters can be chosendifferently in different situations, as desired. The word breakingcharacter here includes the possibility of a plural set of characters.Thus, for example a domain name, such as uspto.gov can consider theperiod as a word breaking character and apply coding to uspto and gov astwo separate words. In certain situations it can be convenient to codeboth individual words and combinations of these as phrases.

Sometimes it may be convenient to have two different coding tables, onefor data storage (Data Coding Table) the other for searches (SearchCoding Table). Then the forward Search Coding Table can code sets ofsynonyms of a word as the same code, allowing searches using a synonymto succeed, while the reverse coding table would not be needed forsearching. This means that when a user enters any one of the synonyms,the same code can be used to search the data. The reverse coding tablemay only be needed for decoding the data in the data items, which cannecessarily give only one data component, the literal one, for eachcode.

To allow the user to choose a synonym search or a literal search, theforward coding table can in fact be two coding tables (or one codingtable but with two entries for each word having a synonym): one canallow only literal coding of data components, while the other can allowsynonymous coding.

The synonym codes can be associated with every item containing any ofthe synonyms while the literal codes can be associated only with itemswhich contain the literal data component.

When creating the Data Coding Table for item content and encounteringportions which are binary, the content can be split into some reasonablelength data components, such as 10 character substrings. When creating aSearch Coding Table in these cases, the coded data may need to be parsedinto text or other useful data components before being coded.

Dealing with More Complex Data

The reverse code table, which acts as a key to the coded data, can bestored very compactly as long as it does not contain a great number oflarge data components. For example, if all the data is textual and thechosen data components do not comprise a large number of very long textstrings, the reverse table stored on disk will be quite small, on theorder of 100 MB. If however the data includes a very large number ofpictures and/or large movies, and/or large numbers of very long textstrings, the size of the reverse table can become large, its sizedominated by the cumulative size of the long strings and graphics. Inthose databases where such large data is present, the data items, orjust the very large data items, can be replaced in the reverse table byreferences, which specify the location of these data items. Such dataitems may then need to be stored on the local area network and ifsecurity is needed, in a secure protected location. Alternatively theycan be stored remotely, but coded in any convenient manner. For example,pictures and movies can be coded as binary files, with each group ofbinary components coded as a number and entered into the item codingtable.

The case of formatted documents forming part or all of the data may betreated similarly to binary data. Alternatively formatted text data canbe completely coded using any number of similar methods. For example,the formatted document may be converted to an xml format, which clearlydelimits textual content from the formatting. The current trend in factis to use xml formats for all documents. For those, conversion may notbe necessary. For others, it might be advisable to convert valuablelegacy documents to xml so they remain accessible for a long time.

Such xml formatted documents can then treat the xml formatting tags asdata components, distinguished in any convenient way, from the textcontent data components. Searches on both the textual content and theformatting can use the formatting tags.

Aggregated Use of Multiple Servers

The number coding system can be used very effectively to search andaccess multiple databases using just one client and an aggregatingserver.

Assume that N servers, termed slave servers, each serving its owndatabase using its own whole number Coding system, referred to as localcodes, are installed. An aggregating server is also installed andconfigured to serve any number of clients and to communicate with eachof the slave servers. Each client can use a coordinated set of numbercodes, referred to as the global codes, to represent the query to theaggregating server. The global codes can check each local code's datacomponent and assign to it a suitable code. If the same data componentis in two or more local codes one global code can replace it. If some orall of the slave data sets contain the same kind of data, we can expectthat there can be many same data components amongst two or more localcodes.

Each slave server, or pre-processing application, can create therespective local codes for all its data components. Several ways can beused to communicate securely between the aggregating server and theslave servers.

One possible method, called the single code set method, creates theunion set of all the slave server sets of data components and codes themusing one set of codes defined by a single pair (i.e forward andreverse) of code tables. To be able to use just one pair of code tablesfor all slave servers, each slave server's association matrices can beconverted from the local codes to the equivalent global codes. For thisconversion, a translation table for each slave server's codes can becreated and used in the conversion.

For optimum performance in the translation process, each local to globaltranslation table can be implemented as an array of code numbers, whereeach array element's index is the local code number of the datacomponent and the element's value is the global code of the same datacomponent. This can provide the fastest lookup performance during theconversion process.

Another method, called the local-to-global aggregation method, canrequire the conversion of codes from local to global and the reverse, tobe performed in real time by the aggregating server. For optimumperformance, this can require two conversion tables for each slaveserver: one using array indexes for the local codes and the other usingarray indexes for the global codes.

Assuming the local-to-global aggregation method, the aggregating serverperforms the following basic functions:

-   1. receives a query from a client, expressed in terms of the global    codes;-   2. translates the global codes used in the query to N queries, one    for each corresponding slave server, using the global-to-local    conversion tables;-   3. sends each translated copy of the query to the corresponding    slave server;-   4. receives the response from each slave server in terms of its    local codes;-   5. converts the response from each slave server to use the global    codes;-   6. aggregates the converted responses into one response by creating    a union of the response codes for each part of the query;-   7. sends the aggregated response to the client.

When using the single code set method, step 2 of the above steps becomesunnecessary.

Data items can be similarly handled, except that it may be easier tocreate global coding even before creating the association matrices. Forexample, each slave server, or other local application can determine thetotal number of local data items. Then each slave server, or the codingapplication can be assigned a sufficient range of number codes toaccommodate all data items. In that way translation from local to globaland back would not be necessary.

The response to a query includes the unique identifiers of the matchingitems (item IDs). Matching item IDs are passed from a slave server tothe aggregating server. Similarly, in order for a client to decode thecoded content of data items, it should have the necessary code table.Therefore item IDs passed to the aggregating server and passed by theaggregating server to the client as part of the response to the query,should be able to determine to which slave database each item belongs.This slave data information can be number coded, or coded in any otherconvenient way. If it is number coded and if the single code set methodis used for data item codes, then the range in which the code numberlies can determine which local slave server can locate the item.

The aggregating server can be in any location, either in the samelocation as the client or in some remote location. Because it is dealingonly with numbers, security is reasonably assured.

After that, each separate data set can maintain its currency, followingtransactions, locally by adding codes to new data and possibly, thoughnot necessarily, deleting obsolete data codes. To make the transactionsavailable to all users of the aggregated data, either the aggregatingserver's conversion tables is updated, or the single code set is updatedand securely transmitted to each client, after an update of any slaveserver data. For secure data, this update should be performed at alocation secure from break in.

During any transactions of data, supported by a slave server, that slaveserver's forward and reverse tables may need to be updated. Updatingthese tables will generally only be necessary if new whole number codesare created during the transactions. These new codes will only be neededwhen added data requires new data components not already coded withnumber codes. Secure transmission of the meaning of new codes can useindividual character codes for security.

When a transaction involves the deletion of a data component, and thereis no data remaining in the database which contains that data component,the number code assigned to that data component becomes available forother data components and can be re-used. Alternatively it can beretained in case the same data component is added in a futuretransaction.

When new codes of data components are added to a slave server, theaggregating server should be updated. The following describes a possiblemethod of creating the aggregating server conversion tables.

The metadata associated with each slave server comprises the forward andreverse tables. We refer to these as the slave tables. One embodiment ofthe invention assigns codes in a slave table independently of those inanother slave table. Such tables are termed uncoordinated tables.

In another embodiment of the invention, all of the slave tables arecoordinated. This means that the same data components have the samecodes in all slave tables. In another embodiment of the invention, notall but a plurality of slave tables are coordinated. In this embodimentsome slave tables are coordinated while others are uncoordinated. Thesetables are called the partially coordinated tables.

Coordinated tables allow a more efficient aggregation process. Howevercreating a coordinated set of slave tables requires more effort and insome circumstances may not be practical.

Using a coordinated set of slave tables the aggregating server need onlysend, to each slave server, copies of the query received from the clientand then aggregate the responses and send them back to the client.

There are other possible methods of using parallel servers to have thoseservers process queries from a client. For example, it is possible touse the text version of the data components between the client and asecurely located aggregating server. Then at the aggregating server'ssite convert the query comprised of data components to each slaveserver's codes and send it to each slave server using the appropriatecodes.

The following is an example of methods that can be used to create theconversion tables allowing quick conversion of the slave codes tocoordinated codes and vice versa. For fastest lookup of codes in eachdirection two sets of translation tables can be used. One set,consisting of one table per slave server, or a combined table, for quicktranslation of a coordinated code to the associated slave codes. Theother set consisting of one table per slave server, for the reverselookup, for quick translation of an uncoordinated code from a slaveserver to the coordinated one. One very efficient implementation of eachsuch table is an array, where the array element index, or a simpleoffset of the index, is the code number being looked up.

TABLE 4 Data UID UID UID Component Slave 1 Slave 2 Slave 3 blue 12 34 1yellow 11 30 9 violet 9 12 40 green 15 44 6

TABLE 5 Data Array index = Slave Slave Slave Component coordinated ID 1ID 2 ID 3 ID yellow 1 11 30 9 green 2 15 44 6 blue 3 12 34 1 violet 4 912 40

Given a set of uncoordinated tables (UT) one method of creatingconversion table (CT) is as follows. Preferably, though not necessarily,start with the longest uncoordinated forward table (UFT) (that is, thetable for quick lookup given the data component) to determine itsuncoordinated ID (UID). This table can be stored in an associativearray, that is a hash table, where the hash key is the data componentand the value associated with that hash key is the UID. This startingtable and its reverse, the uncoordinated reverse table (URT) will thenbe the source of the starting entries, contributed by each slave serverdata, to the coordinated table (CT). The entries in the otheruncoordinated tables are then added, one at a time, to the coordinatedtable using any coordinated codes already assigned to a data componentand adding sequential code numbers to any data component not yetassigned a coordinated code.

Both the forward and reverse tables can be used to make the aggregationprocess as fast as possible in both directions. Each kind of table canbe implemented as an array where the index of the array element is theID to be looked up. The forward table can be an array of data componentvectors where the index of the array identifies the ID of the datacomponent and that vector's components are the IDs of that datacomponent as used on each of the respective slave servers. Table 5illustrates such a structure where each row is a vector and the firstcell in each vector is the array index which is the ID of thecoordinated data component.

For example, the aggregating server receives a query, using datacomponent IDs, from a client which uses only coordinated IDs. It usesthe forward table in which the row number is the coordinated ID, CID,and the values in the row (that is, components of the respective vector)are the respective slave server local IDs. Thus the aggregating serveruses the several local IDs and converts the query into the severalseparate local queries.

The nature of the response of the slave servers to a query depends onthe type of implementation and type of databases used. If a normalrelational database system is used, the response to the query is a listof IDs of the matching items. These IDs can be coordinated between theservers (for example, by assigning a range of IDs for each local severto use) or they can use independent uncoordinated IDs for the items andthen the aggregating server will need to convert these IDs to thecoordinated set and return the result to the client.

If the databases use faceted navigation, such as the TIE database, thenthe aggregating server has the task of translating and uniting the listof available data components returned by each server. The available datacomponents are those which are associated with any one of the matchingitems.

Overview of Client-Server Number Coded Process

The details of the system organization for security of the data dependon the specifics of the application and the environment. The followingexample outlines some general features and parts.

Number Coded Secure Data

In this the following applies:

-   1. All data is coded.-   2. The client computer is secure from intrusion.-   3. The coding key is stored on a small flash plug which can be    stored in a safe when not in use.-   4. The flash plug with the coding key is plugged into the client    computer to enable access to the server and data.-   5. The coding key can be zipped and passworded.-   6. When a number of databases are in an organization, each will have    its coding key and each coding key can be identified by a unique    number, associating it with the database which uses it. That    association, which can be hidden from any intruder, provides an    additional safety layer.

The number coding secure system has the following parts:

-   1. Server which uses only the number codes.-   2. The Coder which:    -   2.1. Creates the assignment of numbers to elements of data.    -   2.2. Codes each field value in each record in the data and each        field name.    -   2.3. Creates the new coded records, allowing the originals to be        moved to a secure place as backups.    -   2.4. Compresses and passwords the coding key and outputs it to a        flash drive.-   3. The Code Interpreter which uses the key and:    -   3.1. Converts all queries from their textual to their coded        form.    -   3.2. Converts all responses from the server to their text form        for the client to display.    -   3.3. Converts any data records requested by the user from their        coded form to their text form.

The Code Interpreter in many cases can be integrated into the clientcode. The Coder is usually best created as a separate application.

The above steps are preferred whether a GIA client and server are usedor the steps are implemented as an add-on to a relational database.

Although the invention has been discussed with respect to variousembodiments, it should be recognized that the invention comprises thenovel and non-obvious claims supported by this disclosure.

What is claimed is:
 1. A computer implemented method of providing securedata storage, the data comprised of data items which are comprised ofdata components, a plurality of the data components being comprised of aplurality of text characters, the method comprising: coding at least alldata components needing secure storage such that each unique datacomponent of a plurality of data components is assigned a unique codeunrelated to the semantic meaning of the data component; storing thedata using the coded data components; ensuring that decoding a codeddata component is not needed to search for it; ensuring that to replaceeach code with a corresponding data component requires a table with atleast as many code entries as there are codes used.
 2. The method ofclaim 1 wherein each code is a number, the method further comprising:creating a code table storing the coded data components' codes. creatingan association matrix storing associations of data items with the numbercodes of the data components;
 3. The method of claim 2 wherein thenumber codes are sequential numbers.
 4. The method of claim 2 wherein amajority of the coded data components are comprised of a plurality oftext characters.
 5. The method of claim 2 wherein a whole number code ofa data component in the code table is arithmetically related to thetable row number which contains the data component.
 6. The method ofclaim 2 where the code table comprises a list of data components whereinthe number code of each data component is the list item number.
 7. Themethod of claim 5 wherein the code table is implemented in a softwareprogram as an array of vectors.
 8. The method of claim 1 furthercomprising performing coding of the data components on the client. 9.The method of claim 8 further comprising transferring the coded data tothe server.
 10. A computer implemented method of providing data storageand search services, using a client-server system, in which a clientcomputer is in a first location and a server computer is in a secondlocation, the database comprised of data items which are comprised ofdata components, the method comprising: choosing a plural set of datacomponents for coding in the first location; assigning a number code toeach of the chosen data components in the first location; assigningidentifiers to each of a plurality of data items in the first location;in the first location, creating a code table for converting each codeddata component's assigned number code to the data component, such thatthe number code is arithmetically related to the number of the table rowwhich contains the data component; and storing the number codes at thesecond location; wherein the code table is stored in a location otherthan the second location.
 11. The method of claim 10 wherein a reversetable is implemented as an array of data components, in which a datacomponent is stored in the array at an index arithmetically related tothe data component's whole number code.
 12. The method of claim 10wherein data items are stored coded using number codes and assignment ofa number code to a data component is made in such as way as to changethe true frequency of occurrence of the data component within the data.13. The method of claim 10 wherein a data component is assigned aplurality of number codes.
 14. A computer implemented method of codingdata by assigning whole number codes to data components of data items,the method comprising: accepting input of a data component; comparingthe data component to other data components that have already beencoded; assigning a whole number code to the data component; storing thedata component and its code; performing a search for a data componentwithout decoding the data and without adding any performance overhead ascompared with searches through uncoded data.
 15. The method of claim 14further comprising: determining a count of the number of times the datacomponent appears in the data; assigning a plurality of different wholenumber codes to a data component such that the number of assigned codesto a data component is adjusted to hide the true frequency of occurrenceof the data component in the data.
 16. The method of claim 14 furthercomprising: determining a count of the number of times the datacomponent appears in the data; assigning a plurality of different wholenumber codes to a data component such that the number of assigned codesto a data component is adjusted to hide the true frequency of occurrenceof the data component in the data.